Predicting risk of the subsequent early pregnancy loss in women with recurrent pregnancy loss based on preconception data

Background For women who have experienced recurrent pregnancy loss (RPL), it is crucial not only to treat them but also to evaluate the risk of recurrence. The study aimed to develop a risk predictive model to predict the subsequent early pregnancy loss (EPL) in women with RPL based on preconception data. Methods A prospective, dynamic population cohort study was carried out at the Second Hospital of Lanzhou University. From September 2019 to December 2022, a total of 1050 non-pregnant women with RPL were participated. By December 2023, 605 women had subsequent pregnancy outcomes and were randomly divided into training and validation group by 3:1 ratio. In the training group, univariable screening was performed on RPL patients with subsequent EPL outcome. The least absolute shrinkage and selection operator (LASSO) regression and multivariate logistic regression were utilized to select variables, respectively. Subsequent EPL prediction model was constructed using generalize linear model (GLM), gradient boosting machine (GBM), random forest (RF), and deep learning (DP). The variables selected by LASSO regression and multivariate logistic regression were then established and compared using the best prediction model. The AUC, calibration curve, and decision curve (DCA) were performed to assess the prediction performances of the best model. The best model was validated using the validation group. Finally, a nomogram was established based on the best predictive features. Results In the training group, the GBM model achieved the best performance with the highest AUC (0.805). The AUC between the variables screened by the LASSO regression (16-variables) and logistic regression (9-variables) models showed no significant difference (AUC: 0.805 vs. 0.777, P = 0.1498). Meanwhile, the 9-variable model displayed a well discrimination performance in the validation group, with an AUC value of 0.781 (95%CI 0.702, 0.843). The DCA showed the model performed well and was feasible for making beneficial clinical decisions. Calibration curves revealed the goodness of fit between the predicted values by the model and the actual values, the Hosmer–Lemeshow test was 7.427, and P = 0.505. Conclusions Predicting subsequent EPL in RPL patients using the GBM model has important clinical implications. Future prospective studies are needed to verify the clinical applicability. Trial registration This study was registered in the Chinese Clinical Trial Registry with the registration number of ChiCTR2000039414 (27/10/2020). Supplementary Information The online version contains supplementary material available at 10.1186/s12905-024-03206-9.


Background
Pregnancy loss (PL) is defined as the spontaneous demise of a pregnancy before the fetus reaches viability (from the time of conception until 24 weeks of gestation), also referred to as a miscarriage or spontaneous abortion, is one of the common health problems in childbearing women, which impacts 10-15% of clinically recognized pregnancies [1].And early PL (EPL) is the loss pregnancy before 10 weeks of gestational age, which accounts for more than 80% of all PLs [2,3].1-3% of women of childbearing age will have two or more PLs, known as recurrent pregnancy loss (RPL).Studies have identified various risk factors being related to PL, including age, BMI, ethnicity, previous miscarriages, uterine anatomy abnormalities, chromosomal abnormalities, infection, immune dysfunction, endocrine disturbance, and unhealthy lifestyle [4,5], but there still approximately 60% of RPL cases remains unexplained, and these cases are referred to as unexplained RPL (URPL) [3].
RPL greatly affects the physical and mental health of couples of childbearing ages [6].Women with a history of RPL showed more psychological problems during their subsequent pregnancy.Notably, couples must deal with the cumulative emotional effects by the subsequent RPL [7].Whereas, some patients come to the hospital seek for help when they had a pregnancy-related concern (bleeding, abdominal pain, or worsening anxiety due to prior miscarriages or ectopic pregnancy), at which point the risk of pregnancy loss is increased.In addition, clinicians use clinical and demographic information to predict pregnancy outcomes after a patient becomes pregnant, but some early pregnancy loss is inevitable.For women who have experienced RPL, it is not only important to diagnose and treat them, but also to evaluation of the risk for recurrence which can reduce the subsequent PL rates.
The risk of RPL increased with the number of times of pregnancy loss, and the incidence of pregnancy loss was only 11.6% in women without a history of pregnancy loss.In women with a history of one, two, and three or more pregnancy losses, the probability of subsequent pregnancy loss was 19.8%, 27.7%, and 41.9%, respectively [8].A population-based study has found that the lowest risk of PL (9.8%) in women aged 25-29 years and the risk of PL increases in women aged 30-35 years, then rises steeply to 33.2% in women aged 40-44 year [8].The number of previous PLs was another independent risk factor for RPL [9].Genetic factors may also be involved in the risk of miscarriage.One large genome-wide association study identified four distinct susceptibility loci for sporadic and RPL that have a role in progesterone production, placentation, and gonadotropin regulation [10].There is research found that women experiencing bleeding without nausea between 6-and 8-weeks' gestation had an increased risk of clinical pregnancy loss, but bleeding and nausea were not predictive risk factors of clinical pregnancy loss prior to 6 weeks' gestation [11].A meta-analysis found that early pregnancy ultrasound markers, including fetal bradycardia, crown rump length, intra uterine hematoma, and mean gestational sac diameter minus could predicting miscarriage in women with diagnosed viable intrauterine pregnancy [12].In recent years, some novel markers of immune tolerance and angiogenesis in maternal blood have been reported as potential RPL predictors, including immune tolerance proteins galectin-9 (Gal-9) and interleukin (IL)-4, and angiogenesis proteins (vascular endothelial growth factors (VEGF) A, C, and D) [13].
Among these risk factors, some of them are limited by their late appearance or poor temporary availability and it is difficult to comprehensive assessment of the risk of subsequent RPL.Furthermore, most RPL risk assessment use traditional regression models (e.g., logistic regression), which make an implicit assumption that each risk factor is linearly related to PL [14], this may ignore the complex relationships of many risk factors with non-linear interactions, and the predictive performance is always suboptimal.Therefore, exploring an effective prediction model to predict the subsequent EPL for RPL patients is necessary.
Recently, machine learning (ML) methods, has been reported to demonstrate a powerful self-learning ability with improved prediction accuracy [15,16], and it has been successfully applied to diagnosing diseases and predicting clinical outcomes, such as for in vitro fertilization treatment [15], RPL [16], postpartum hemorrhage [17] and other pregnancy pathological events [18].Furthermore, the nomogram as a simple statistical visual tool, which is widely used to predict the occurrence of diseases.In this study, we develop and validate a prediction machine learning model based on the preconception demographic information, reproductive history, and clinical blood parameters of admission to identify the risk of subsequent EPL for RPL patients.

Participants
The study population was drawn from a prospective, dynamic cohort, which was carried out at the Department of Reproductive Medicine, Second Hospital of Lanzhou University [19].The cohort began in September 2019 and enrolled 1050 nonpregnant RPL patients through December 2022.The inclusion criteria were: (1) Have experienced at least two history of PL that meets the diagnostic criteria of the ESHRE; (2) aged 18-42 years.The follow-up period ended in December 2023 and the exclusion criteria were: (1) Patients who were lost to follow-up and who were not yet pregnant; (2) Subsequent pregnancy outcomes are ectopic pregnancy, hydatidiform mole, dysplasia, and current pregnancy < 10 weeks; (3) Subsequent pregnancies were assisted reproductive technology and twin pregnancies.This study was approved by the Ethics Committee of Lanzhou University Second Hospital (Ethical Approval Number: 2019 A-231).All patients provided written informed consent.

Predictive variables
Demographic information (including age, height, weight, education, ethnic, menarche, menstrual cycle, and pelvic surgery), reproductive history (including total pregnancy numbers, pregnancy loss numbers, induced abortion, live birth, and pregnancy type) were obtained from outpatient medical records and body mass index (BMI) was calculated as weight in kilograms divided by the square of body height in meter.Preconception treatments and subsequent pregnancy outcome was obtained from the follow-up.Each patient was followed up through electronic medical record system and telephone every 6 months after the first visit to track the patient's pregnancy status, most recently in December 2023.Blood samples obtained at the initial visit in a nonpregnant state in in the morning, when the patient was underwent overnight fasting, and was tested according to the standard manufacturer's protocols within an hour at our hospital and the blood parameters including 50 indicators.The demographic information and blood test indicators are presented in Table 1.

Outcomes
The primary outcomes of this analysis included subsequent EPL and ongoing pregnancy (OP).EPL was defined as pregnancy less than 10 weeks of gestational age, including biochemical pregnancy.OP was defined as pregnancy beyond 10 weeks of gestational age.

Statistical analyses
Continuous variables are described by mean ± standard deviation or median (interquartile range), categorical variables are described using percentages.Independent sample T test, Mann-Whitney-Wilcoxon test, chisquare test and Fisher's exact test were used appropriate.Multiple imputations were performed for a few missing variables (details of statistical methods for each variable, the number and percentage of missing data are presented in Supplementary Table 1).We generated five data sets by multiple imputations, and sensitivity analysis showed that these five data sets were not significantly different from the original data, the results are presented in Supplementary Table 2.All analyses were performed using Empower(R) (www.empow ersta ts.com, X&Y solutions, inc.Boston MA) and R (http:// www.R-proje ct.org).

Variable selection
The dataset of the RPL women was randomly split into the development (75%) and validation (25%) groups.Data for 63 variables during pre-pregnancy were obtained from the patient self-reports and electronic medical records.First, in order to create an efficient approach for clinical practice with fewer redundant variables, we performed independent sample t test, Mann-Whitney-Wilcoxon test, chi-square test or Fisher's exact test at the appropriate time, and then the variables with P < 0.05 were used the least absolute shrinkage and selection operator (LASSO) logistic regression algorithm and multivariate regression analysis with the training group to select related features.

Model training, evaluation, and validation
The features selected by LASSO regression were performed on the training group using the four models including generalize linear model (GLM), gradient boosting machine (GBM), random forest (RF), and deep learning (DP).After the model was established, we used area under the ROC curve (AUC), area under the precisionrecall curve (AUCPR), logloss, mean per-class error, root mean square error (RMSE) and mean square error (MSE) to compare the models.And the model with the largest AUC value was selected as the best model.Next, the variables selected by LASSO regression and multivariate logistic regression were established and compared using the best prediction model in the training group.The AUC, calibration curve (Hosmer-Lemeshow test), and clinical decision curve (DCA) were performed to assess the prediction performances and clinical utility of the best model.We further performed an internal validation for the developed prediction model using the validation group.Finally, based on the best predictive features, a nomogram was established to take advantage of fitting a line with a non-linear relationship for the prediction of subsequent EPL.

Baseline characteristics
Finally, 605 eligible RPL patients were enrolled in this stud, they were randomly divided into training group and validation group by 3:1 ratio, with 454 patients in the training group and 151 patients in the validation group.A flow chart of the process is represented in Fig. 1   were identified as independent predictors of subsequent EPL in women with prior RPL (Table 3).

Construction and validation of prediction models
Subsequent EPL prediction model was constructed in the training group using GLM, GBM, RF, and DP with 16 variables selected by LASSO logistic.The four models' performance results including AUC, AUCPR, logloss, mean per-class error, RMSE and MSE are shown in Table 4. Overall, the GBM model achieved the best performance, with the highest AUC (0.805) and AUCPR (0.783).
Then we used the 9 variables screened again by multivariate regression to construct GBM prediction models in the training group, and compared the prediction performance using 16 and 9 variables, respectively.The results showed that the use of 9 variables did not significantly reduce the prediction performance in the training group, the AUC in 16-variable and 9-variable models were 0.805 (95%CI 0.716, 0.878) and 0.777 (95%CI 0.690, 0.853), P > 0.05 (Table 5; Fig. 3A-C).The threshold probability of the DCA is 28% and the corresponding net benefit is 0.44 in 16-variable model, 23% and the corresponding net benefit is 0.48 in 9-variable model.It indicates that two models improve the benefit, and there was no significant difference between the two models (Fig. 3D). Figure 3E-F shows the calibration curve, which suggested that subsequent EPL by 16-variable and 9-variable model were essentially accurate, the Hosmer-Lemeshow test p-value of the two models were 0.607 and 0.559.Figure 3G-H shows the 95% CI in the training group for 16-variable and 9-variable model.
Because the 16-variable model did not significantly improve predictive power over the 9-variable model, we used the 9-variable model for internal validation based on the validation group.The predictive model displayed a well discrimination performance in the validation group, with an AUC value of 0.781 (95%CI 0.702, 0.843) (Fig. 4A).The DCA showed the model performed well and was feasible for making beneficial clinical decisions (Fig. 4B).Calibration curves revealed the goodness of fit between the predicted values by the model and the actual values, the Hosmer-Lemeshow test was 7.427, and P = 0.505 (Fig. 4C).
Finally, 9 variables were finally selected for nomogram presentation in Fig. 5.A total score was obtained by adding matching points for each parameter in the

Discussion
The factors affecting the pregnancy outcome of RPL patients are complex and diverse, but it is worth mentioning that a comprehensive review of guidelines states that genetic thrombophilia, vaginal infections, and immunologic and male factors of infertility are not recommended as part of routine RPL investigations and there is also some controversy about the need for ovarian reserve testing, thyroid disease, screening for diabetes or hyperhomocysteinemia, measurement of prolactin levels, and endometrial biopsy [20].In our study, we first compared the accuracy of multiple machine learning algorithms (GLM, GBM, RF and DP) in predicting subsequent recurrent EPL in patients with RPL, using demographic information and multiple clinical parameters of women with RPL before pregnancy.Subsequently, the best prediction model was used to find that there were no detailed differences in the prediction models constructed with 16 variables and 9 variables.Meanwhile, the 9-variable model displayed a well discrimination performance in the validation group, and the DCA showed the model performed well and was feasible for making beneficial clinical decisions.Calibration curves revealed the goodness of fit between the predicted values by the model and the actual values, the Hosmer-Lemeshow test was 7.427, and P = 0.505.The 9 variables include age, BMI, PLs, induced abortion, ACA, HCY, IgM, LHR, and PNR.Our study brings forward the risk assessment of subsequent EPL in women with RPL before pregnancy, which has very important clinical implications.The association between female age and RPL has been consistently demonstrated in several studies.Studies have shown that couples should start trying to conceive when the woman is 31 or less to have at least a 90% chance of having a two-child family, and if IVF is not feasible, couples should start planning no later than 27.In order to achieve a one-child family, couples should start trying before the age of 32, or 35 if IVF is an option [21].There are also variations in the threshold of BMI for pregnancy.Zhang et al. reported that, a BMI of 24.0 kg/ m 2 or greater was associated with an increased risk of RPL, but Lo and colleagues demonstrated that maternal obesity (BMI ≥ 30.0 kg/m 2 ) significantly increased the risk of miscarriage in couples with unexplained RPL and there was no increased risk in women with overweight and underweight [22,23].The conflicting results may due to differences in study design, varying definitions of RPL and BMI ranges.A systematic review and meta-analysis found that the maternal BMI of women with a history of RPL is significantly higher than the BMI of controls, mean difference 0.7 kg/m 2 [95% CI 0.2-1.3].It is recommended that BMI be discussed as part of preconception and abortion counseling [24].
We found that for patients with RPL, previous induced abortion increased the risk of RPL recurrence, however previous studies have found that the risk of spontaneous abortion decreases with the increase in the number of induced abortions.This is not consistent with our results.The possible reason is the reference population was derived from all female workers in the Jinchang cohort in China, most of whom had normal reproductive function [25].In addition, recent studies have found that for IVF patients, termination, miscarriage, ectopic pregnancy, or prior live birth does not compromise subsequent live birth and perinatal outcomes [26].We also found that the risk of RPL recurrence increased with the number of previous miscarriages.Some studies have found that ≥ 4 previous miscarriages increase the cumulative clinical pregnancy loss rate and reduce the cumulative live birth rate in young women [27], however other studies found that the risk of further miscarriage following two or three RPLs is similar [28].In a nested cohort, it was demonstrated that the number of prior miscarriages was a determinant both for time to live birth and cumulative incidence of live birth [29,30].It is worth noting that for secondary unexplained RPL, only consecutive pregnancy losses after the birth influenced the subsequent prognosis, while the number of losses prior to the birth did not affect the prognosis in the next pregnancy [31].
The negative effects of elevated HCY levels on pregnancy are well known, which is associated with a variety of pregnancy complications, such as preeclampsia (PE), early PL (EPL), placental abruption (PA), intrauterine growth restriction (IUGR) and venous thrombosis [32].Approximately one third of spontaneous abortion before 20 weeks' gestation are associated with elevated HCY levels [33].A longitudinal study based on Chinese population has explored the reference intervals of HCY in three periods of pregnancy, which provides a basis for the management and detection of HCY in Chinese women during pregnancy [34].However, most studies on the relationship between HCY and pregnancy diseases have focused on the first trimester, ignoring the effect of HCY before pregnancy.Research from our team found that for women with a previous miscarriage, HCY can increase the uterine artery resistance in the nonpregnant state and is associated with the abortion rate of subsequent pregnancy [35].The present study found that  pre-pregnancy HCY plays an important role in the recurrence of RPL in women based on pre-pregnancy data, suggesting that pre-detection of HCY levels in women trying to become pregnant has a positive effect on preventing the recurrence of RPL.
Antiphospholipid antibodies (aPLs) are the leading causes of adverse pregnancy outcomes (APOs).A cluster analysis found that patients with triple antibodies or high-risk lupus characteristics were prone to occurred gestational hypertension and premature delivery.Isolated LA or ACA/aβ2GPI positivity were found to be more frequently associated with early-stage fetal loss [36].Takeshita et al. found that the only risk factor for persistently positive ACA antibodies is a high antibody titer during the initial test.When the ACA antibody titer in the initial test exceeds the cut-off value (ACA -IgG antibodies > 15 U/mL and ACA -IgM antibodies > 11 U/mL), treatment can be initiated immediately [37].The complement system has attracted attention as a potential mediator of pathogenic mechanisms induced by aPL.Complement C3 and C4 serum levels were assessed in several cohorts of pregnant patients with APS and/or aPL, these studies have yielded inconsistent results, while some studies have come to find a correlatio, other studies have not revealed a prognostic role for the complement in relation to pregnancy morbidity among aPL-positive women [38][39][40].Our study reconfirmed the important effect of positive aPLs and C4 on the outcome of the next pregnancy of RPL patient.But a meta-analysis found that the presence of positive aPL  neither decreased clinical pregnancy rate and live birth rate, nor increased miscarriage rate in women undergoing IVF, which is differed from the opinion of clinical practice [41].
Glucose and lipid metabolism levels are not included in the routine screening program of RPL patients in the current guidelines.In this study, 2-hour postprandial glucose and 2-hour postprandial insulin were significantly Fig. 4 The ROC, DCA, and calibration plots in the validation group for the 9-variables using GBM.A ROC curves for GBM model in the validation group.B DCA for GBM model in the validation group.The y-axis represents the standardized net benefit (sNB), the X-axis represents the threshold probability.The cost-benefit ratio is also shown below the DCA.The black line represents the net benefit when all subjects not occurred EPL, and the gray line is the net benefit at each risk threshold when all subjects occurred EPL.The red line is the net benefits of the risk probabilities estimated at the risk threshold.C Calibration curves for GBM model in the validation group Fig. 5 Nomogram for the subsequent EPL prediction of RPL with 9 variables, which including age, BMI, PLs, induced abortion, ACA, HCY, IgM, LHR, and PNR elevated in women with subsequent EPL, however, there was no significant difference in FBG and FINS between the EPL and OP groups.Study have found that 2-h postprandial glycemia level is more precise than fasting glycemia for type 2 diabetes [42].As early as 10 years ago, researchers had found that the 1-, 2-, and 3-hour plasma glucose and insulin levels were significantly higher in women with RPL (more than 2 PLs) as compared to controls [43].Numerous studies associate abnormal glucose metabolism in the endometrium with a higher risk of adverse pregnancy outcomes [44].Furthermore, several studies have linked altered levels of lipids and a higher risk of adverse pregnancy outcomes, which is visible in patients with recurrent miscarriage (RM) [45,46], HDL concentrations were lower in women with RM and together with the abdominal obesity were the most frequent components of the RM profile [47].Subsequently, the study by Depciuch et al. reconfirmed that changes in the metabolomic and lipidomic pathways may be potential risk factors as well as therapeutic targets for RM [48].In addition, a retrospective cohort study found serum lipid levels were associated with treatment outcomes in women undergoing assisted reproduction, higher HDL-C was associated with greater numbers of oocytes retrieved, higher live birth rates, and lower miscarriage rates [49].Besides, in long-term follow-up, the researchers found that females with history of PL were experienced more prediabetes (50% vs. 45.5%),diabetes (28.9% vs. 21.3%), and metabolic syndrome (70% vs. 60.1%)than females without such history [50].This also revealed the interaction between metabolism and RPL.
The role of systemic inflammatory reactions in the pathogenesis of EPL has been confirmed in several studies [51][52][53][54].Inflammatory markers from complete blood count (CBC), such as platelet lymphocyte ratio (PLR), neutrophil-lymphocyte ratio (NLR), neutrophil to monocyte ratio (NMR), lymphocyte to monocyte ratio (LMR) and platelet to neutrophil ratio (PNR) are readily available [55].This study we found that elevated PNR was a risk factor for EPL in RPL patients in their next pregnancy, and no statistically significant differences were found in the remaining inflammatory markers between the EPL and OP groups.However, the levels of PNR in different diseases are found to be inconsistent.In reproductive events, lower PNR were associated with early natural menopause [56].For mothers with hypertensive disorders of pregnancy, neonatal Apgar scores at 1 and 5 min of birth were positively associated with PNR [57].For patients with ST-elevation myocardial infarction, a nonlinear relationship was found between the PNR and major adverse cardiovascular events, which was positively associated with the PNR when the PNR exceeded threshold [58].For patients with ovarian cancer, PNR were independent prognostic indicators of poor relapse-free survival [59].
However, our study still has some limitations.First, there are some missing data and it is necessary to evaluate the characteristics of RPL women more comprehensively, such as blood pressure, waist circumference and hip circumference, and register more detailed reproductive history and examination results.Second, the inflammatory and immune status of the endometrium are thought to be closely related to RPL, but such data were lacking in this study.Finally, this study is a single-center study and lacks external validation, the model presented here needs further study with more multi-center clinical data.Based on the above shortcomings, our team is carrying out a cohort study of RPL patients to observe and record the whole process of RPL patients from the first visit to the subsequent pregnancy outcome in detail, to provide more accurate clinical treatment for RPL patients.

Conclusions
Our study innovated the use of pre-pregnancy demographic data and clinical laboratory indicators to predict subsequent EPL in RPL patients, which has important clinical implications.Age, BMI, PLs, induced abortion, ACA, HCY, IgM, LHR, and PNR are the key factors affecting subsequent EPL.

Fig. 2
Fig. 2 Clinical features selection using the least absolute shrinkage and selection operator (LASSO).A Tuning parameter(lambda) selection in the LASSO model used 10-fold cross-validation via minimum criteria.B LASSO coefficient profiles of the 16 clinical features.A coefficient profile plot was produced against the log (λ) sequence

Fig. 3
Fig. 3 The ROC, DCA, calibration plots, and 95% confidence intervals in the training group for the 16-variables and 9-variables model using GBM.Model 1: 16 variables prediction model; Model 2: 9 variables prediction model.A-B ROC curves for models 1 and 2. C Comparison of ROC curves between model 1 and model 2. D DCA for model 1 and model 2. The y-axis represents the standardized net benefit (sNB), the X-axis represents the threshold probability.The cost-benefit ratio is also shown below the DCA.The black line represents the net benefit when all subjects not occured EPL, and the gray line is the net benefit at each risk threshold when all subjects occurred EPL.The blue and red lines are the net benefits of the risk probabilities estimated by models 1 and 2 at the risk threshold.E-F Calibration curves for model 1 and model 2. G-H The 95% confidence intervals for model 1 and model 2

Table 1
The baseline demographic information and blood test indicators of RPL patients between the training and validation group

Table 2
The baseline demographic information and blood test indicators of RPL patients between subsequent EPL and OP in the training group

Table 3
A multivariate logistic regression found the risk factors for subsequent EPL in women with RPL Abbreviations: BMI Body mass index, IgM Immunoglobulin M, C4 Complement C4, HCY Homocysteine, 2 h-BG 2-hour postprandial blood glucose, 2 h-INS 2-hour postprandial insulin, HDL High-density lipoprotein, CHR Cholesterol to high-density lipoprotein ratio, LHR Low-density lipoprotein to high-density lipoprotein ratio, PNR Platelet to neutrophilic ratio, TPO-Ab Thyroid peroxidase antibodies, ACA Anti cardiolipin antibody, β2GP1 β2-glycoprotein 1, LA Lupus anticoagulant, OR Odds ratios, CI Confidence interval

Table 4
Performance results of four models in the training group based on 16 variables GLM Generalize linear model, GBM Gradient boosting machine, RF Random forest, DP Deep learning